Cardio Good Fitness Assignment 1

This Notebook presents some preliminary and exploratory Data Analysis on the CardioGoodFitness dataset. The following functions are used to explore the dataset and extract basic observations about the data.

At the end of this exercise, I'll generate a set of insights and recommendations that will help the company in targeting new customers

This Exploratory Data Analysis will be divided into 3 sections:

  1. Univariate analysis
  2. Bivariate analysis
  3. Insights and recommendations

Introduction

The following command installs the latest version of pandas-profiling, a python library used to do basic exploratory data analysis (EDA).

Note: the code before the pip command makes certain that I am running the pip version associated with the current Python kernel.

Once the package pandas-profiling is installed, I import the numpy, pandas, matplotlib and seaborn packages, as well as the object ProfileReport from the pandas_profiling package

The following command reads the csv file containing our data and assigns it to the variable object data. the head method is used to show part of the data.

The following command will give brief info about the dataset, including dataframe size and structure

1. Exploratory Data Analysis (EDA)

Dataset Profile

The following command will generate a report with the function ProfileReport from the pandas_profiling package and assigned to the object 'profile'

The following code uses the method to_widgets() on the object 'profile' to create some HTML reports of the variables.

The Overview tab gives a summary of the dataset with number of observations and variables

The Variables tab gives an univariate analysis of each variable.

  1. If the variable is categorical, such as Product, Gender, MaritalStatus and fitness, a descriptive statistics table as well as a horizontal bar plot are displayed to show the frequency each category.
  2. If the variable is continuous, such as Age, Education, Usage, Income, Miles, a descriptive statistics table as well aa histogram are displayed to show the distribution.

The Correlations tab shows a correlation heatmap of the continuous variables.

The Missing values tab shows a bar plot for each variable and the freqency of missing values. There are no missing values in this dataset.

The Sample tab represents the head() and tail() of the dataset. (first and last 10 rows)

Visualizing ALL Numerical Variables at Once

The first line of the following code create a list of numerical variables. the variable 'Fitness' can be used as categorical oir continous variables in regression for example, even if strictly speaking, it is an ordinal variable.

The second line of this code creates a 2*3 plot matrix of the numerical variables with the 'hist' method from the pandas package

Continuous Variables - Outliers

The following shows boxplots of the dataset continuous variables with median, IQR and outliers.

Observation from boxplots: teh variables income and miles have lots of outliers, here as blue circles. If statistical modeling were to be plied such as multiple regression with the 'Miles' variable as the outcome variables, outliers need to be fixed with some imputation or deletion

Visualizing ALL Categorical Variables at Once

The following code creates a list of categorical variables called 'categoricals', create subplots for each categorical variable and loop through each of them to create the actual plot with pandas plot function, with options for colors and xticks rotation

2. Bivariate Analysis and Observations

This section will include informative visualizations to answer some questions about the dataset.

Self-rated Fitness Scores and Gender

Mosaic plots allow to visualize multivariate categorical data in an informative way. In this plot, I'm trying to look at how the elf rated fitness score of the customer relates to gender.

I first import the mosaic class from statsmodel and then draw a mosaic plot with color properties: red for Male and blue for Female

Observation from Mosaic Plot: The following plt show that about 50% of customers, males and females equaly, show a self rated fitness score of 3, meaning these customers rate themselves as quite fit. From those customer who rate themselves very fit (5), most are men. Looking at the psychology, womentend to be harder on themselves when it comes to fitness, so we need to keep this inmind.

Correlations Between Continous Variables

The following code first finds the pairwise correlation of all columns in the dataframe. Any NA values are automatically excluded and any non-numeric data type columns (MaritalStatus) in the dataframe it is ignored.

A correlation heatmap is then created with the seabornfunction 'heatmap' with some added options to add correlation values.

Observation from Correlation Matrix Plot: From a statistical pointf view, a correlation of 0.7 and above is accepable, good or very good. From this correlation ot, we canee that the variable "Miles" has a fairly strong correlation with Usage (0.76) and strong correlation with the variable "Fitness". We can concludeorm this correlation that the higher a customer sef-rate, the morehe's running his/her miles.

On the other hand, the variable "Age" or no correlation with "Usage" (0.02), "Fitness" (0.06) and "Miles" (0.04), while showing a medium correlation with "Income" (r=0.51)

Relationship Between Age, Miles, Gender and Marital Status

The following code creates a scatterplot to look at the relationship between Age, Miles, Gender and Marital Status

Observation from Scatterplot: this scatterplot shows "spatial" correlation (instead of a fixed number like above) between Age, les, Gender and Marital Status. We can see that the bulk of customers are between 18 and 35, runtween 45 and 200 miles, with females runningmostlup to 120 miles. the sizew of the dots with the green area shows lots of customers are single.

Relationship Between Products and Dataset Variables

In this section, the data has been grouped by products to look at education, usage, fitness scores, income and miles by product. Then simple scatterplot is drawn.

Observation from simple scatterplot: from the scatterpolot, the Product TM798 seems to attract a lot more affluent and educated people. We can assume that TM798 is the latest model therefore might be more expensive. And we know from demographics psychology that more educated people seem to have a higher salary.

Relationship Between Products and Income

The following code displays a simple scatterplot to look at income by Model number of treadmill used by customer.

Observation from Barplot: this barplot shows that the product TM798 is bought by more affluent people. TM 798 might be the latest treadmill model and might be more expensive.

Relationship Between Marital Status and Products Used

In this section, the data has been grouped by products as well as marital status, to look at age, education, usage, fitness scores, income and miles by product and marital status. Then categorical barplot is drawnusing the function 'catplot' from seaborn.

Observation from Categorical Plot: Another important piece of information can be gleaned from this bar plot. The TM798 is purchased by e partnerd than single.

3. Recommendations

In the following section, I am going provide some recommandation to the owner Cardio Good Fitness club based on the data I explored.

Conclusion

In this first assignment, I explore the Cardio Good Fitness dataset by doing univariate and bivariate analyses. I then recommended solutions to potential onwer of the fitness store.